Locality-Sensitive Hashing

24 July 2024 (updated 24 July 2024)

$$ \gdef\S{\mathrm{S}} \gdef\darc{d_{\mathrm{arc}}} \gdef\a#1#2{\mathrm{A}_{#1}\p{#2}} \gdef\sign#1{\mathrm{sign}\p{#1}} $$

Let $\S_n$ be an $n$-dimensional unit $n$-sphere

$$ \begin{align} \mathrm{S}_n = \setb{\vec x ∈ \R^n}{\norm{\vec x}_2 = 1} \end{align} $$

and let $\darc$ be the arc length distance

$$ \begin{align} \darc\p{\vec a, \vec b} = \cos^{-1}\p{\vec a ⋅ \vec b} \end{align} $$

Random projections

$\mathcal S$ The uniform distribution on $\S^n$.
$\mathcal F$ The distribution by constructing.
$n$ The number of dimensions.
$m$ The number of hyperplanes in a vector.
$k$ The number of vectors.

Hyperplanes are defined by a normal unit vector $\vec h ∈ \S^n$. For a given vector $\vec x$ we can determine on which side of the plane it lies by the sign of the dot product $\vec x ⋅ \vec h$.

$$ \begin{align} \mathrm{Pr}_{\vec h ∈ \mathcal S}\left[\sign{\vec h ⋅ \vec a} = \sign{\vec h ⋅ \vec b} \right] = 1 - \frac{\darc\p{\vec a, \vec b}}{π} \end{align} $$

If we pick $m$ random hyperplanes $\vec h_i ∈ \S^n$ we can map each vector $\vec x ∈ \S^n$ into a $m$-bit vector $b(\vec x)$:

$$ \begin{align} b_i\p{\vec x} = \begin{cases} 0 & \vec h_i ⋅ \vec x ≤ 0 \\ 1 & \vec h_i ⋅ \vec x > 0 \end{cases} \end{align} $$

Assuming the hyperplanes are uniformly random, the probability of two bitvectors being equal is:

$$ \begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align} $$

The hyperplanes form a tesselation of $\S^n$ where each region has a unique bitvector and each boundary represents a single bit change. Interestingly this implies that a Hamiltonian path through regions (if one exists) creates a Gray code.

We have two distributions on $\S^n$, $\mathcal{U}_{match}$ and $\mathcal{U}_{non-match}$. Given a vector $\vec x$ we want to classify it's distribution.

Assumption. Iris embeddings of different irises are uniformly distributed on $\S^n$ so the $\darc$ of non-matching irises follows the formula.

$$ \begin{align} \Pr{b\p{\vec a} = b\p{\vec b}} = \p{1 - \frac{\darc\p{\vec a, \vec b}}{π} }^m \end{align} $$

Amplification

What is the probability of a match if $\darc < T$.

We are a looking for a family of hash functions $\mathcal H$, $h: \S^n → H$ such that

$∀_{\vec a, \vec b ∈ \S^n}\, \darc\p{\vec a, \vec b} < T → \mathrm{Pr}_{h ∈ \mathcal H}\left[h\p{\vec a} = h\p{\vec b} \right] ≥ P_1$
$∀_{\vec a \S^n}\, \mathrm{Pr}_{h ∈ \mathcal H}\left[h\p{\vec a} = h\p{\vec b} \right] ≥ P_1$

Irreversibility

Assume we are given $H\p{\vec a}$, what can we learn about $\vec a$?

We can at best hope to learn all of the hyper-plane signs. This is $m⋅k$ bits of information. The subset of $S^n$ that has the same signs is a convex polygon with an (assumed) expected area of $2^{-m⋅k} A_n$. We would not have any more information than this, so our posterior would at best be the uniform distribution over this region.

We can start by drawing many samples from $\mathcal S$. and find matching vectors $k$.

TODO https://www.notion.so/Draft-Outline-99ecafcd45384d58a06bc1003fe1088b

https://en.wikipedia.org/wiki/Locality-sensitive_hashing

https://www.cs.princeton.edu/courses/archive/spring04/cos598B/bib/CharikarEstim.pdf

https://arxiv.org/abs/2010.09393

https://randorithms.com/2019/09/19/Visual-LSH.html

https://arxiv.org/pdf/2010.09393.pdf

https://arxiv.org/pdf/1901.09769.pdf